Exploring a quantity, broken down into > 2 subgroups
Visualization gallery: comparison boxplot, faceted histograms, density and ridgeline plots
Assessing the assumptions of regression (ANOVA) with residual plots
Intro: Using simple (single) imputation to deal with missing data
An Example to Work through on your own
Today’s Setup
knitr::opts_chunk$set(comment=NA)library(broom) ## tidy, glance and (new) augment!library(ggridges) ## help with ridgeline plotslibrary(kableExtra) ## tidy up tables of outputlibrary(janitor)library(mosaic) ## for favstatslibrary(naniar)library(simputation) ## enable single imputation of NAslibrary(patchwork)library(tidyverse)theme_set(theme_bw())
Today’s Data
Today, we’ll use an R data set (.Rds) to import data.
bs_dat <-read_rds("c11/data/blood_storage.Rds")
This allows us to read in the data just as they were last saved in R, including “factoring”, etc.
readRDS() also works but is a little slower.
To write an R data set, use write_rds(datasetname, "locationoncomputer").
saveRDS() would also work, but slower.
The blood storage data set
This study1 evaluates the association between red blood cells (RBC) storage duration (categorized into three groups) and time (in months) to biochemical prostate cancer recurrence after radical prostatectomy.
In cancer patients, perioperative blood transfusion has long been suspected of reducing long-term survival, and it is suspected that cancer recurrence may be worsened after the transfusion of older blood.
ggplot(bs_cc, aes(x = age_group, y = recur_time)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
Visualizing Strategies
We’re trying to look at the impact of age_group on recur_time.
Comparison Boxplot
Faceted Histograms
Overlapping Density Plot
Ridgeline Plot
So let’s walk through each of these.
Comparison Boxplot
ggplot(data = bs_cc, aes(x = age_group, y = recur_time)) +geom_violin() +geom_boxplot(aes(fill = age_group), width =0.3, notch =TRUE, outlier.size =2) +guides(fill ="none") +coord_flip() +scale_fill_viridis_d(alpha =0.5) +labs(y ="Recurrence Time (in months)",x ="Red Blood Cell age group",title ="Recurrence Time by RBC Age Group")
Comparison Boxplot
Add MEANS to Comparison Boxplot
ggplot(data = bs_cc, aes(x = age_group, y = recur_time)) +geom_violin() +geom_boxplot(aes(fill = age_group), width =0.3, notch =TRUE, outlier.size =2) +stat_summary(fun ="mean", geom ="point", shape =23, size =3, fill ="white") +guides(fill ="none") +coord_flip() +scale_fill_viridis_d(alpha =0.5) +labs(y ="Recurrence Time (in months)",x ="Red Blood Cell age group",title ="Recurrence Time by RBC Age Group",caption ="Diamonds indicate sample means")
Add MEANS to Comparison Boxplot
Faceted Histograms
ggplot(data = bs_cc, aes(x = recur_time, fill = age_group)) +geom_histogram(bins =20, col ="navy") +guides(fill ="none") +facet_grid(age_group ~ .) +labs(x ="Recurrence Time (in months)",title ="Recurrence Time by RBC Age Group")
Faceted Histograms
Comparing Densities
ggplot(data = bs_cc, aes(x = recur_time, fill = age_group)) +geom_density() +scale_fill_viridis_d(alpha =0.5, option ="A") +labs(title ="Time to Recurrence, by RBC Age Group")
Using a Ridgeline Plot
ggplot(data = bs_cc, aes(x = recur_time, y = age_group, fill = age_group)) +geom_density_ridges(alpha =0.5) +guides(fill ="none") +labs(title ="Time to Recurrence, by RBC Age Group")
Model Recurrence Time using Age Group
Again, we’re using the complete cases here, from bs_cc.
the regression relationship is linear, rather than curved, and we can assess this by plotting the regression residuals (prediction errors) against the fitted values and looking to see if a curve emerges.
Do we see a curve in the plot we draw next?
Plot residuals vs. fitted values from m1
ggplot(m1_aug, aes(x = .fitted, y = .resid)) +geom_point() +geom_smooth(method ="lm", col ="red",formula = y ~ x, se =FALSE) +geom_smooth(method ="loess", col ="blue",formula = y ~ x, se =FALSE) +labs(title ="m1: Residuals vs. Fitted Values", x ="Fitted sbp values", y ="Residuals")
Plot residuals vs. fitted values from m1
Linear Model Assumptions?
We assume that:
the regression residuals show similar variance across levels of the fitted values, and again we can get insight into this by plotting residuals vs. predicted values.
Do we see a fan shape in the plot we draw next?
Does the variation change materially as we move from left to right?
Plot residuals vs. fitted values from m1
A Fuzzy Football
What we want to see in the plot of residuals vs. fitted values is a “fuzzy football.”
Linear Model Assumptions?
We assume that:
the regression residuals (prediction errors) are well described by a Normal model, and we can assess this with all of our usual visualizations to help decide on whether a Normal model is reasonable for a batch of data.
Do the residuals from our model appear to follow a Normal distribution?
favstats(~ .resid, data = m1_aug) |>kbl(digits =1) |>kable_styling(font_size =24)
min
Q1
median
Q3
max
mean
sd
n
missing
-34
-25.5
-6.7
18.5
72.9
0
28.4
289
0
Alternative m1 residual plots?
par(mfrow =c(1,2)); plot(m1, which =1:2); par(mfrow =c(1,1))
Imputation
Dealing with the Missing Data
We have done all analyses on complete cases, but that’s not always wise.
What if doing so would bias our conclusions?
Here we have two missing age_group values and one missing recur_time.
It’s scary to estimate these missing values. What could we do?
Single Imputation
In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.
The simputation package can help us execute single imputations using a wide variety of techniques, within the pipe approach used by the tidyverse.
See Section 9.8 of the Course Notes for some additional examples.
Estimate missing values?
bs_dat |>select(-participant) |>summary()
age_group units recur_time
younger:97 Min. :1.000 Min. : 0.270
middle :98 1st Qu.:2.000 1st Qu.: 7.685
older :95 Median :2.000 Median : 26.690
NA's : 2 Mean :2.048 Mean : 33.297
3rd Qu.:2.000 3rd Qu.: 52.685
Max. :4.000 Max. :103.600
NA's :1
Which values are missing and must be imputed?
Create an imputation model
The simputation package is our friend here. We’ll use
But these two models do not have the same outcomes. Why not? What does this mean about comparing the glance results?
Three Types of Missingness
MCAR = Missingness completely at random.
A variable is missing completely at random if the probability of missingness is the same for all units, for example, if for each subject, we decide whether to collect data on a measure by rolling a die and refusing to answer if a “6” shows up. If data are missing completely at random, then throwing out cases with missing data (i.e. doing a complete case analysis) does not bias your inferences.
Three Types of Missingness
MAR = Missingness at random.
Missingness that depends only on observed predictors. A more general assumption, called missing at random or MAR, is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of nonresponse to depends only on the other, fully recorded variables in the data.
Here is the situation that most obviously cries out for imputation.
Three Types of Missingness
Missing not at random
This is a bigger problem, and includes both:
Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values.
Missingness that depends on the missing value itself. For example, suppose that people with higher earnings are less likely to reveal them.
OK, back to working with complete cases
Back to our Comparison Boxplot
Does comparing means make sense here?
Are these sample distributions “Normal-ish”?
Would a Transformation Help Us?
favstats(~ recur_time, data = bs_cc)
min Q1 median Q3 max mean sd n missing
0.27 7.6 25.3 52.07 103.6 32.89225 28.47644 289 0
Since all recur_time values are positive, we might look at:
\(log(time)\), or \(1/time\), or \(\sqrt{time}\), or \(time^2\), for example…
What are we hoping these transformations will do?
Boxplot 0: recur_time by age_group
Boxplot 1: log(recur_time) by age_group
Boxplot 2: 1/(recur_time) by age_group
Boxplot 3: \(\sqrt{time}\) by age_group
Code for Boxplot 3
ggplot(data = bs_cc, aes(x = age_group, y =sqrt(recur_time))) +geom_violin() +geom_boxplot(aes(fill = age_group), width =0.3, notch =TRUE, outlier.size =2) +stat_summary(fun ="mean", geom ="point", shape =23, size =3, fill ="white") +guides(fill ="none") +coord_flip() +scale_fill_viridis_d(alpha =0.5) +labs(y ="Square Root of Recurrence Time",x ="Red Blood Cell age group",title ="Square Root of Recurrence Time by RBC Age Group",caption ="Diamonds indicate sample means")
Ridgeline Plot for \(\sqrt{time}\)?
ggplot(data = bs_cc, aes(x =sqrt(recur_time), y = age_group, fill = age_group)) +geom_density_ridges(alpha =0.5) +guides(fill ="none") +labs(title ="Square Root of Time to Recurrence, by RBC Age Group")
Fit a Model to predict \(\sqrt{time}\)?
m2 <-lm(sqrt(recur_time) ~ age_group, data = bs_cc)m2